Random Forest

Kristen Monaco, Praya Cheekapara, Raymond Fleming, Teng Ma

Random Forest Overview

  • Ensemble machine learning method based on a large number of decision trees voting to predict a classification
  • Benefits compared to decision tree:
    • Able to function with incomplete data -Lower likelihood of an overfit -Improved prediction accuracy

Bootstrap Sampling (Bagging)

  • Each decision tree uses a random sample of the original dataset
    • Using a subset of the dataset reduces the probability of an overfit model
    • Rows with missing data will often be left out of the sample, improving performance
    • Performed with replacement

Random Feature Selection

  • A random set of features is selected for each node in training
    • Information about feature importance may be saved and applies in future iterations
    • Even with automated random feature selection, feature selection and engineering prior to training may improve performance

Cross Validation

  • Validation of performance of model
    • Resampling method similar to bootstrapping, but without replacement
    • Allows approximation of the general performance of a model

Prediction

  • Each trained decision tree produces its own prediction
    • Decision trees are independent, and were trained on different subsets of both data and features

Ensemble Voting

  • The results from each decision tree are combined into a voting classifier
    • The mode of the classification results will be the final prediction

Dataset

  • South African Red List
    • Data about plants with their habitat, traits, distribution, and factors influencing their current threatened/extinct status
  • Purpose
    • Predict whether or not an unknown plant is threatened based on the above characteristics

Visuals 1

  • Distribution Range

Visuals 2

  • Cramer’s V Association

Analysis

  • 5 separate random forest models were created using separate methods of normalization

Data Preparation

  • Preprocessing
    • Encode categorical features into numerical / factor features
    • Split the training set into a training and test set, avoiding class imbalance

Preprocessing

  • Class Imbalance
    • Resample smaller classes in order to approximate equal classes
    • Training on imbalanced datasets will bias predictions to the larger class

Normalization

  • Apply 5 normalization techniques to both training and test datasets
    • Min-Max
    • Z-Score
    • Max Absolute Value
    • L1 Norm
    • L2 Norm

Prediction

  • Combine results into a vector
  • Identify the most frequently predicted class
  • Iterate over entire test set, storing results
  • Generate a confusion matrix, calculate the sensitivity, and precision for each category
  • Iterate after tuning if necessary

Results

  • Range was found to be the strongest predictor of extinction
  • Habitat loss is the second strongest predictor of extinction